{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "view-in-github"
   },
   "source": [
    "<h2 align=\"center\">点击下列图标在线运行HanLP</h2>\n",
    "<div align=\"center\">\n",
    "\t<a href=\"https://colab.research.google.com/github/hankcs/HanLP/blob/doc-zh/plugins/hanlp_demo/hanlp_demo/zh/pos_mtl.ipynb\" target=\"_blank\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>\n",
    "\t<a href=\"https://mybinder.org/v2/gh/hankcs/HanLP/doc-zh?filepath=plugins%2Fhanlp_demo%2Fhanlp_demo%2Fzh%2Fpos_mtl.ipynb\" target=\"_blank\"><img src=\"https://mybinder.org/badge_logo.svg\" alt=\"Open In Binder\"/></a>\n",
    "</div>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "WfGpInivS0fG"
   },
   "source": [
    "## 安装"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "IYwV-UkNNzFp"
   },
   "source": [
    "无论是Windows、Linux还是macOS，HanLP的安装只需一句话搞定："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "1Uf_u7ddMhUt"
   },
   "outputs": [],
   "source": [
    "!pip install hanlp -U"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "pp-1KqEOOJ4t"
   },
   "source": [
    "## 加载模型\n",
    "HanLP的工作流程是先加载模型，模型的标示符存储在`hanlp.pretrained`这个包中，按照NLP任务归类。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "id": "4M7ka0K5OMWU",
    "outputId": "50ad002e-4363-46cd-8f5d-b6d6aad3e957"
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "{'OPEN_TOK_POS_NER_SRL_DEP_SDP_CON_ELECTRA_SMALL_ZH': 'https://file.hankcs.com/hanlp/mtl/open_tok_pos_ner_srl_dep_sdp_con_electra_small_20201223_035557.zip',\n",
       " 'OPEN_TOK_POS_NER_SRL_DEP_SDP_CON_ELECTRA_BASE_ZH': 'https://file.hankcs.com/hanlp/mtl/open_tok_pos_ner_srl_dep_sdp_con_electra_base_20201223_201906.zip',\n",
       " 'CLOSE_TOK_POS_NER_SRL_DEP_SDP_CON_ELECTRA_SMALL_ZH': 'https://file.hankcs.com/hanlp/mtl/close_tok_pos_ner_srl_dep_sdp_con_electra_small_20210111_124159.zip',\n",
       " 'CLOSE_TOK_POS_NER_SRL_DEP_SDP_CON_ELECTRA_BASE_ZH': 'https://file.hankcs.com/hanlp/mtl/close_tok_pos_ner_srl_dep_sdp_con_electra_base_20210111_124519.zip',\n",
       " 'CLOSE_TOK_POS_NER_SRL_DEP_SDP_CON_ERNIE_GRAM_ZH': 'https://file.hankcs.com/hanlp/mtl/close_tok_pos_ner_srl_dep_sdp_con_ernie_gram_base_aug_20210904_145403.zip',\n",
       " 'UD_ONTONOTES_TOK_POS_LEM_FEA_NER_SRL_DEP_SDP_CON_MT5_SMALL': 'https://file.hankcs.com/hanlp/mtl/ud_ontonotes_tok_pos_lem_fea_ner_srl_dep_sdp_con_mt5_small_20210228_123458.zip',\n",
       " 'UD_ONTONOTES_TOK_POS_LEM_FEA_NER_SRL_DEP_SDP_CON_XLMR_BASE': 'https://file.hankcs.com/hanlp/mtl/ud_ontonotes_tok_pos_lem_fea_ner_srl_dep_sdp_con_xlm_base_20210602_211620.zip',\n",
       " 'NPCMJ_UD_KYOTO_TOK_POS_CON_BERT_BASE_CHAR_JA': 'https://file.hankcs.com/hanlp/mtl/npcmj_ud_kyoto_tok_pos_ner_dep_con_srl_bert_base_char_ja_20210914_133742.zip'}"
      ]
     },
     "execution_count": 1,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "import hanlp\n",
    "hanlp.pretrained.mtl.ALL # MTL多任务，具体任务见模型名称，语种见名称最后一个字段或相应语料库"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "BMW528wGNulM"
   },
   "source": [
    "调用`hanlp.load`进行加载，模型会自动下载到本地缓存。自然语言处理分为许多任务，分词只是最初级的一个。与其每个任务单独创建一个模型，不如利用HanLP的联合模型一次性完成多个任务："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {
    "id": "0tmKBu7sNAXX"
   },
   "outputs": [],
   "source": [
    "HanLP = hanlp.load(hanlp.pretrained.mtl.CLOSE_TOK_POS_NER_SRL_DEP_SDP_CON_ELECTRA_BASE_ZH)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "elA_UyssOut_"
   },
   "source": [
    "## 词性标注\n",
    "任务越少，速度越快。如指定仅执行词性标注，默认CTB标准："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/",
     "height": 70
    },
    "id": "BqEmDMGGOtk3",
    "outputId": "5ad7fd22-651a-4403-d897-a9492eb15854"
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div style=\"display: table; line-height: 128%;\"><pre style=\"display: table-cell; font-family: SFMono-Regular,Menlo,Monaco,Consolas,Liberation Mono,Courier New,monospace; white-space: nowrap;\">HanLP/NR&nbsp;为/P&nbsp;生产/NN&nbsp;环境/NN&nbsp;带来/VV&nbsp;次/JJ&nbsp;世代/NN&nbsp;最/AD&nbsp;先进/JJ&nbsp;的/DEG&nbsp;多语种/NN&nbsp;NLP/NR&nbsp;技术/NN&nbsp;。/PU</pre></div><br><div style=\"display: table; line-height: 128%;\"><pre style=\"display: table-cell; font-family: SFMono-Regular,Menlo,Monaco,Consolas,Liberation Mono,Courier New,monospace; white-space: nowrap;\">我/PN&nbsp;的/DEG&nbsp;希望/NN&nbsp;是/VC&nbsp;希望/VV&nbsp;张晚霞/NR&nbsp;的/DEG&nbsp;背影/NN&nbsp;被/LB&nbsp;晚霞/NN&nbsp;映红/VV&nbsp;。/PU</pre></div>"
      ],
      "text/plain": [
       "<IPython.core.display.HTML object>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "HanLP(['HanLP为生产环境带来次世代最先进的多语种NLP技术。', '我的希望是希望张晚霞的背影被晚霞映红。'], tasks='pos').pretty_print()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "jj1Jk-2sPHYx"
   },
   "source": [
    "注意上面两个“希望”的词性各不相同，一个是名词另一个是动词。\n",
    "执行PKU词性标注："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/",
     "height": 70
    },
    "id": "1goEC7znPNkI",
    "outputId": "586afd5d-db0d-41bd-f7de-411f37062a8c"
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div style=\"display: table; line-height: 128%;\"><pre style=\"display: table-cell; font-family: SFMono-Regular,Menlo,Monaco,Consolas,Liberation Mono,Courier New,monospace; white-space: nowrap;\">HanLP/nx&nbsp;为/p&nbsp;生产/vn&nbsp;环境/n&nbsp;带来/v&nbsp;次/b&nbsp;世代/n&nbsp;最/d&nbsp;先进/a&nbsp;的/u&nbsp;多语种/n&nbsp;NLP/nx&nbsp;技术/n&nbsp;。/w</pre></div><br><div style=\"display: table; line-height: 128%;\"><pre style=\"display: table-cell; font-family: SFMono-Regular,Menlo,Monaco,Consolas,Liberation Mono,Courier New,monospace; white-space: nowrap;\">我/r&nbsp;的/u&nbsp;希望/n&nbsp;是/v&nbsp;希望/v&nbsp;张晚霞/nr&nbsp;的/u&nbsp;背影/n&nbsp;被/p&nbsp;晚霞/n&nbsp;映红/v&nbsp;。/w</pre></div>"
      ],
      "text/plain": [
       "<IPython.core.display.HTML object>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "HanLP(['HanLP为生产环境带来次世代最先进的多语种NLP技术。', '我的希望是希望张晚霞的背影被晚霞映红。'], tasks='pos/pku').pretty_print()\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "wxctCigrTKu-"
   },
   "source": [
    "同时执行所有标准的词性标注："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "id": "Zo08uquCTFSk",
    "outputId": "d2b3eb65-06e6-47a6-d954-04cae27d6c51"
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "{\n",
      "  \"tok/fine\": [\n",
      "    [\"HanLP\", \"为\", \"生产\", \"环境\", \"带来\", \"次\", \"世代\", \"最\", \"先进\", \"的\", \"多语种\", \"NLP\", \"技术\", \"。\"],\n",
      "    [\"我\", \"的\", \"希望\", \"是\", \"希望\", \"张晚霞\", \"的\", \"背影\", \"被\", \"晚霞\", \"映红\", \"。\"]\n",
      "  ],\n",
      "  \"pos/ctb\": [\n",
      "    [\"NR\", \"P\", \"NN\", \"NN\", \"VV\", \"JJ\", \"NN\", \"AD\", \"JJ\", \"DEG\", \"NN\", \"NR\", \"NN\", \"PU\"],\n",
      "    [\"PN\", \"DEG\", \"NN\", \"VC\", \"VV\", \"NR\", \"DEG\", \"NN\", \"LB\", \"NN\", \"VV\", \"PU\"]\n",
      "  ],\n",
      "  \"pos/pku\": [\n",
      "    [\"nx\", \"p\", \"vn\", \"n\", \"v\", \"b\", \"n\", \"d\", \"a\", \"u\", \"n\", \"nx\", \"n\", \"w\"],\n",
      "    [\"r\", \"u\", \"n\", \"v\", \"v\", \"nr\", \"u\", \"n\", \"p\", \"n\", \"v\", \"w\"]\n",
      "  ],\n",
      "  \"pos/863\": [\n",
      "    [\"w\", \"p\", \"v\", \"n\", \"v\", \"a\", \"nt\", \"d\", \"a\", \"u\", \"n\", \"ws\", \"n\", \"w\"],\n",
      "    [\"r\", \"u\", \"n\", \"vl\", \"v\", \"nh\", \"u\", \"n\", \"p\", \"n\", \"v\", \"w\"]\n",
      "  ]\n",
      "}\n"
     ]
    }
   ],
   "source": [
    "print(HanLP(['HanLP为生产环境带来次世代最先进的多语种NLP技术。', '我的希望是希望张晚霞的背影被晚霞映红。'], tasks='pos*'))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "以`pos`开头的字段为词性，以`tok`开头的第一个数组为单词，两者按下标一一对应。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 注意\n",
    "Native API的输入单位限定为句子，需使用[多语种分句模型](https://github.com/hankcs/HanLP/blob/master/plugins/hanlp_demo/hanlp_demo/sent_split.py)或[基于规则的分句函数](https://github.com/hankcs/HanLP/blob/master/hanlp/utils/rules.py#L19)先行分句。RESTful同时支持全文、句子、已分词的句子。除此之外，RESTful和native两种API的语义设计完全一致，用户可以无缝互换。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "suUL042zPpLj"
   },
   "source": [
    "## 自定义词典\n",
    "自定义词典为词性标注任务的成员变量，要操作自定义词典，先获取一个词性标注任务，以CTB标准为例："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "id": "AzYShIssP6kq",
    "outputId": "640cefa5-1d6d-464b-81d2-83c66e2081f2"
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "<hanlp.components.mtl.tasks.pos.TransformerTagging at 0x160950910>"
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "pos = HanLP['pos/ctb']\n",
    "pos"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "1q4MUpgVQNlu"
   },
   "source": [
    "自定义单个词性："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/",
     "height": 35
    },
    "id": "2zZkH9tRQOoi",
    "outputId": "ed0bb8fe-2e68-4c58-e11e-ff6a0cc69ae4"
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div style=\"display: table; line-height: 128%;\"><pre style=\"display: table-cell; font-family: SFMono-Regular,Menlo,Monaco,Consolas,Liberation Mono,Courier New,monospace; white-space: nowrap;\">HanLP/state-of-the-art-tool&nbsp;为/P&nbsp;生产/NN&nbsp;环境/NN&nbsp;带来/VV&nbsp;次/JJ&nbsp;世代/NN&nbsp;最/AD&nbsp;先进/JJ&nbsp;的/DEG&nbsp;多语种/NN&nbsp;NLP/NR&nbsp;技术/NN&nbsp;。/PU</pre></div>"
      ],
      "text/plain": [
       "<IPython.core.display.HTML object>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "pos.dict_tags = {'HanLP': 'state-of-the-art-tool'}\n",
    "HanLP(\"HanLP为生产环境带来次世代最先进的多语种NLP技术。\", tasks='pos/ctb').pretty_print()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "F-9gAeIVQUFG"
   },
   "source": [
    "根据上下文自定义词性："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/",
     "height": 35
    },
    "id": "F8M8cyBrQduw",
    "outputId": "16ef7f82-50ff-478f-c3ea-8e768b0cea31"
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div style=\"display: table; line-height: 128%;\"><pre style=\"display: table-cell; font-family: SFMono-Regular,Menlo,Monaco,Consolas,Liberation Mono,Courier New,monospace; white-space: nowrap;\">我/PN&nbsp;的/补语成分&nbsp;希望/名词&nbsp;是/VC&nbsp;希望/动词&nbsp;张晚霞/NR&nbsp;的/DEG&nbsp;背影/NN&nbsp;被/LB&nbsp;晚霞/NN&nbsp;映红/VV&nbsp;。/PU</pre></div>"
      ],
      "text/plain": [
       "<IPython.core.display.HTML object>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "pos.dict_tags = {('的', '希望'): ('补语成分', '名词'), '希望': '动词'}\n",
    "HanLP(\"我的希望是希望张晚霞的背影被晚霞映红。\", tasks='pos/ctb').pretty_print()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "需要算法基础才能理解，初学者可参考[《自然语言处理入门》](http://nlp.hankcs.com/book.php)。"
   ]
  }
 ],
 "metadata": {
  "accelerator": "GPU",
  "colab": {
   "collapsed_sections": [],
   "include_colab_link": true,
   "name": "pos_mtl.ipynb",
   "provenance": []
  },
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.12"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 1
}
